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Abstract 

The possibihty that the sliding motion of proteins on DNA is influenced by the base sequence 
through a base pair reading interaction, is considered. Referring to the case of the T7 RNA- 
polymerase, we show that the protein should follow a noise-influenced sequence-dependent motion 
which deviate from the standard random walk usually assumed. The general validity and the 
implications of the results are discussed. 

PACS numbers: 87.14.Ee, 87.15.Aa, 87.15.Vv 
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I. INTRODUCTION 



How site-specific DNA binding proteins locate their targets on DNA is an issue of primary 
importance for understanding the functioning of DNA. With the development of new exper- 
imental techniques, this problem is getting much of attention, see, e.g., O, S, , S, Q, E | ■ 
Sliding, hopping and uncorrelated three-dimensional diffusion are generally taken into ac- 
count as possible searching mechanisms, and their relative role in target location is being 
discussed and experimentally investigated. In the seminal work of Berg, Winter and von 



diffusion (sliding) along DNA was proposed as a necessary 
. More recent papers confirm the importance oi' 



Hippel (BWH), one-dimensional 
ingredient of the target search 
sliding in the search process, along with three dimensional paths (disattachment of a protein 
from DNA and reattachment to a different segment of DNA) [3]. 

A completely coherent description of the search process is nevertheless still lacking. In a 
recent paper jj], Bruinsma remarks e.g. that the time spent by lac-repressor on each DNA 
site in the frame of the BWH theory is too short to allow the structural changes necessary 
for the protein to recognize its target. He thus indicates the need for a slowing down effect 
and suggests that "indirect read-out" mechanisms, associated to the DNA flexibility, can 
account for it. Note that the DNA sequence, responsible for the DNA flexibility and shape, 
is crucial also for this kind of slowing down effect. 

On the other hand, all existing models of target search dynamics describe the sliding 
motion as a standard random walk. In theoretical analysis of experiments it is indeed taken 
for granted that the protein motion is governed by a linear diffusion, (x^) = 2Dt. While the 
linear diffusion assumption is natural for 3-dimensional paths (when protein is not bound 
to DNA and diffuses in solution), for sliding phase of motion it implies that the DNA is 
essentially "seen" by the protein as a homogeneous chain j9|. This homogeneity of DNA, 
however, seems incompatible with the recognition function, which always involves a form 
of reading, so that it is natural to assume an influence of the DNA sequence on the sliding 
dynamics. This influence could result in slowing down, pauses and stops which, in its turn, 
could invalidate the random walk assumption. These slowing effects can have have a different 



origin from that suggested by Bruinsma 0]; note, nevertheless, that different mechanisms can 
coexist, and that in any case the dynamic effects of (direct or indirect) sequence sensitivity 
are considered. 
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The aim of the present paper is to show that sequence dependence of the DNA-protein 
interaction can induce strong deviations from standard diffusion for a generic protein shding 
on DNA. To this regard, we use a probabihstic model for the shding motion of a protein 
on DNA in which the influence of the base sequence is accounted through the DNA-protein 
reading interaction lOj. As a result we show that the protein follows a noise-influenced 
sequence-dependent motion which deviates from standard diffusion, reaching normal diffu- 
sion only at asymptotically large times. The presence of an anomalous diffusion (AD) regime 
speeds up the mobility of a protein thus greatly facilitating the target search. The cross-over 
from anomalous to normal diffusion occurs at times typically needed for a protein to cover 
the distance at which the potential averages out (of order 100 bp in our model). On the 
other hand, indirect measurements hint on the typical mean path length traversed by the 
protein during a single DNA binding event, of the same order of magnitude (e.g., around 
170 bp in 0). Thus, the anomalous diffusion (AD) should actually dominate the binding 
phase, and cannot be neglected. 

The paper is organized as follows. In section II we introduce the model using T7 RNA- 
polymerase as a specific example of a sliding protein. In section III we investigate the main 
properties of the sliding dynamics including the sub-diffusive regime and the crossover to 
normal diffusion. In section IV we provide some arguments supporting the generality of our 
results in connection to applications to other enzymes. Finally, in section V, results and 
conclusions of the paper are summarized. 



II. THE MODEL 



A target sequence usually consists of few (say, r) consecutive base pairs (bps). Specific 
sequence recognition is often mediated by hydrogen bonds (H-bonds) to a set of four specific. 
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. Besides this 



spatially ordered chemical groups on the major groove side of the bps 
mechanism, other features of DNA such as shape and flexibility, as well as electrostatic 
interactions between protein and DNA 

HQ 

may also be involved in the recognition 
process. In this paper, we will focus mainly on the first mechanism, i.e., we assume that 
proteins check the sequence at each position on DNA by exploiting the same set of hydrogen 
bonds they form with the DNA at the target site. We thus represent the DNA binding 
sites at position n as a sequence of r vectors 6„ (one for each bp), of the form Bn = 
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{bn, bn+1, . . . , bn+r-i), according to the rule 

(1,-1,1, Of for AT, (0, 1,-1, If for TA, 

bn = \ (1) 

:i, 1,-1, 0)^ for GC, (0, -1, 1, 1)^ for CG 
where +1, —1, denote, respectively, an acceptor, a donor, and a missing bond, that each of 



the four base pairs can form with an external ligand at position n on the DNA [ll|. We also 
assume that the H-bonds formed in the DNA-protein complex at the recognition site are 
known (this information can be obtained from crystallographic analysis of the DNA-protein 
complex). The protein can then be represented by a (r x 4) recognition matrix R describing 
the pattern of H-bonds formed by the protein and the DNA at the recognition site. The 
protein-DNA interaction energy is then defined by counting the matching and unmatching 
bonds between the recognition matrix and the DNA sequence at site n, 

E{n) = etr{R- 5„) , (2) 

where e denotes each H-bond energy, fr the trace, and the dot refers to usual matrix mul- 
tiplication. The DNA is thus viewed as a one-dimensional vector lattice characterized by 
a rough on-site potential E{n), on which a random walker (a protein) moves, with rates 
(probability per unit time) 

^n' = min (1/2 , 1/2 exp (-/3 AE,^„0), 

1 '"n— >n+l ''"n^n— 1 i 

where n' = n ±1 and (3 = l/ksT. Time is measured in one-step time units (t.u.). An esti- 
mation for the lower bound of the time unit can be obtained through simple hydrodynamic 
considerations 1^Q|, yielding It.u. ~ 10^^ s. The typical H-bond energy is of order of a 
few kcal/mol, but in fact the actual e could be much less due to screening introduced by the 
water layer around DNA. 

The presence of an activation barrier for the translocation on neighboring sites can be 
accounted for by introducing a uniform threshold energy level Et, so that 

AEn^n' = max[Et - E{n), E{n') - E{n), 0] . (4) 

Note that the effective translocation barrier also depends on the position, through the on-site 
energy. As a specific example, we consider the case of the T7 RNA-polymerase sliding on 
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Table I: The short time sub-diffusive parameters A and b fitted in the initial time interval [0, 100], 
and those characterizing the asymptotic regime, Z^oo and boo, fitted in t £ [810^, 10^]. The 
equilibrium diffusion constant D* is estimated from the mfpt analysis. All values are obtained for 
Pe = 1. 



Et 


2A b 




2D* 


E ■ 


0.82 ± 2% 0.49 ± 1% 


4.4 10-3 ±1% 0.94 ±1% 


4.410-3 





0.48 ± 2% 0.56 ± 1% 


4.3 10-3 ±1% 0.93 ±1% 


4.310-3 


■'-'max 


0.04 ± 3% 0.61 ± 1% 


0.2510-3 ±2% 0.83 ± 1% 


0.210-3 



the bacteriophage T7 DNA. For this case it is known that the recognition site is the five bps 
sequence GAGTC extending from position -11 to -7 in the T7 promoter. The interaction 
matrix R can be inferred from the crystallographic studies of Cheetam et al. jl^, as 

/ 1 1 \ 

1-1 

i?= 1 1 , (5) 
1/2 
\0 1/2 1/ 

where the presence of 1/2 is due to one shared DNA-protein H-bond mediated by a water 
molecule and therefore considered as two half bonds. 



III. THE PROPERTIES: SUBDIFFUSIVITY AND CROSSOVER TO NORMAL 
DIFFUSION 

Theoretically, one can easily calculate the stationary distribution of a population of pro- 
teins on the energy landscape as Poo{n) oc exp (— /3i?(n)), only dependent on the site energy 
and on temperature. This implies that the energy minima, that correspond to the recog- 
nition sites, will be in average the most populated. We then calculate the mean square 
deviation from the average of the spatial displacement, {An^) = where 
average over initial positions and different histories (Monte-Carlo runs) is made. The three 
cases Et = min[E{n)] = Emm-, Et = and Et = max[E{n)] = Emax have been examined. In 
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the limit (3e = the hnear diffusion is recovered, as one expects (the hmiting value 2D = 1 
is obtained in the case Et = Emin, i-e., for a flat potential without thresholds). Nevertheless, 
in the finite temperature case, we obtain large initial deviations from the normal diffusion 
behaviour. More precisely, for all thresholds we find that at the initial stage the diffusion 
displays anomalous sub- diffusive features, with 

(n^) = 2A^^ b<l (6) 

where A and b depend on the fixed threshold level. The appearance of the initial subdiffusive 
regime is not surprising, and has been observed both for random trap and random barrier po- 



tentials, see, e.g., [l7|. Our case in Eq.Q, however, represents a mixture of these two cases, 
for which to our knowledge, there are no studies for the initial time behaviour. On the other 
hand, note that in Eq.Q the hopping rates r„^„_|_i, r„^„_i are not random variables but 
depend on the gradient of the energy landscape, log{rn^n+i/rn+i~>n) = {En+i — En) / {kBT). 
This has the important consequence that in the continuous (Langevin) approximation of the 
process (see, e.g., the effective potential U stays gaussian localized with the typical 

difference U (n) — f/(n — 1) ~ V^cte independent of n, aE being the energy variance. This is 
different from Sinai model where typical U {n) grows with n as y/n, this leading to anoma- 
lous {x"^) ~ {ln{t))^ behaviour . Since Sinai model is not applicable to our case, we will be 
using in the following a rather crude approximation (jHl) to describe the crossover from initial 
subdiffusion to linear diffusion regime. A quantitative characterization of the initial transient 
regime is given in Table U for the three values of Ef. The diffusion constant for the three 
threshold levels is estimated from the linear fit (An^) = 2Dt at large times t G [8 x 10®, 10^]. 
We checked that an effective linear behaviour is roughly reached by evaluating the parameter 
boo in the same range (see Table H}. Asymptotically, a standard diffusion is recovered (on 
the large scale the potential roughness averages to zero). The asymptotic diffusion constant 
D decreases for increasing (3e. The initial deviation from a random walk (1 — 6) and the 
time needed to reach the asymptotic limit both increase with jSe; the typical one-step time 
(or time unit t.u.) should be roughly, for real proteins, of order of micro-second 0,|l3; thus 
giving crossover times up to seconds corresponding to mean displacements up to hundreds 
bps (data not shown; more details will be given elsewhere). A theoretical estimate of the 
large time effective diffusion constant can be obtained from mean first passage time (mfpt) 
analysis. According to Ref. for a discrete one step process, such as the one considered 
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Figure 1: 2D* = (An?) /T^q as a function of the adimensional parameter /?e (full lines), and the 
corresponding 2D directly evaluated by fitting the large time diffusion (symbols), for corresponding 
different values of the threshold energy: Et = Emin (open circles), Et = (triangles) and Et = Emax 
(diamonds). Time is measured in time units (t.u.), see text for details. 

here, the mfpt T^^ to go from a referring position uq to position n > rio can be evaluated, 
once a reflecting barrier is fixed in a position a < no, as 



Note that T"^ depends on the threshold level Et through the rate r„^„_i, according to 
Eq. (jni). For large enough T^^, 



Making the choice a = 0, the theoretical diffusion constant D* as a function of (3e can be 
evaluated using Eq. (jHJ. The result is shown on Fig. ^together with the corresponding 
numerically evaluated diffusion constants. We observe an excellent agreement. Note that 
the diffusion constant decreases exponentially for /5e — oo ( in practice, it is already ^ 1 
for jSe ~ 1) and the corresponding mfpt exponentially increases in the same limit. This 
behaviour reflects the divergence of the typical extent of the sub-diffusive transient, which 
becomes more and more important as jSe approaches 1. 

The model allows also to consider the possibility that very unfavorable positions (with a 
large number of mismatches) could induce protein conformational changes to an extent of 
not allowing the formation of any H-bond, thus inducing a regime of "free sliding" [2^. A 
threshold energy level should in this case separate reading regions from free sliding regions, 
(where the DNA is seen as homogeneous chain). The energy landscape should then be 
redefined to be homogeneous above this threshold: we will put E{n) = E^i if E{n) > E^, 
and refer to this second possibility as "two-state model". In this case, the redefinition of 





{An^) ^ 2DT; 



in 



(8) 
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10^ 10= 10^ 10" 10= 10^ 

t (t.u.) 

Figure 2: Dynamics obtained on an artificial Gaussian energy landscape with -Emm = —Ne, E^ax ~ 
Nel2 (solid lines) compared to that obtained for the T7 RNA-polymerase - DNA interaction 
(symbols), for energy parameters: Et = Ej^ini = 0.5 (squares); Et = E^i^ (3e = 1 (open 
circles); Et = E^nax, /?e = 1 (triangles); "two-state model" with Et = 0, l3e = 1 and Egi = E^ax 
(full circles). Time is measured in time units (t.u.), see text for details. 

the energy landscape leads to a faster diffusion, even if still sub-diffusive, at small times. 
This effect is more evident for low threshold values, i.e., as the energy redefinition involves 
an increasing number of sites, fndeed, if a particle (protein) is located on a flat part of 
the potential, it will start to diffuse freely with maximally possible diffusion constant. Such 
particles contribute to fast diffusion at initial time. After having slid freely for a certain 
time, however, a particle will fall in E < Et region, and will be partially trapped in a 
potential well. After a transient time, a subdiffusive behaviour similar to the previous case 
is indeed reached, that converges, on larger times, to linear diffusion. A detailed analysis of 
the "two-state model" will be presented elsewhere. 

Thus, one sees a substantial deviation from random walk during sliding phase of a target 
search. In the next section, we address the question about the generality of the presented 
results, in apphcation to larger and more complex proteins such as e.g. E. Coh RNA- 
polymerase, lac repressor, EcoRI and EcoRV, i.e., for other H-bond reading enzymes. 

IV. GENERALIZATION TO OTHER ENZYMES AND BINDING MECHANISMS 

First of all, note that the dynamics of the proposed model depends only on the obtained 
energy profile, and that the most important parameter is the single energy contribution e, 
that fixes the energy scale. This quantity, though experimentally difficult to access, should 
nevertheless depend only on the nature of the H-bond: one can thus reasonably expect it to 
be roughly the same for all proteins. The actual threshold mechanism is also unknown, but 



again we could reasonably expect that it depends on general properties of the protein-DNA 
interaction, and does not vary in nature from one protein to another. 

What should represent the main difference between different proteins is therefore the 
length of the recognition sequence 0] , or, more precisely, the number of bonds involved in 
the reading. This parameter should be adapted in order to mimic the sliding of different 
enzymes. 

An examination of the whole set of possible hydrogen bonds that DNA bps can form 
with external ligands jll|, Q shows that, among the 12 possible H-bond sites exposed on 
the 4 different bps, those that are in central binding sites of the bases (6n[2] and 6n[3]) 
can induce both matches or mismatches, while the external ones (&n[l] and &n[4]) are either 
matches or give zero contribution to the interaction energy. It is thus possible to calculate 
explicitly the energy level distribution for a generic enzyme looking for a total of matches 
with A^' of them in the two central binding sites of the bases. The only assumption made 
is that the matches are uncorrelated, which turns out to be a reasonable approximation 
for quasi-random DNA sequences. The resulting energy level distribution is a convolution 
of two binomials that rapidly converges to a Gaussian as A^ and A^' increase Q]. It is 
then easy to calculate the average and standard deviation of the energy that result to be 
{E) = {N — N')e/2 and aE = {N + 3N')e/4 respectively. The minimum and maximum 
energies of the resulting distributions are given by Emm = — A^e, E^ax = N'e. 

This leads us to conclude that, for not too small values of A^ (and A^'), the energy level 
distribution is approximatively a Gaussian, and its width just depends on A^ and A^' (or, 
alternatively, Emin and E^ax )■ Note, furthermore, that if bonds on different positions are 
equiprobable, A^' should be roughly equal to A^/2, so that one ends with only one parameter. 
We can expect therefore that the energy landscape for a generic sliding protein, and therefore 
the sliding motion depends crucially on the number of H-bonds made at the recognition site. 

We have tested the previous arguments by building an artificial energy profile, with ran- 
dom levels distributed as to reproduce the original distribution width and thus the original 
Gaussian shape. In Fig. |21 simulations of the protein sliding motion on the basis of this 
artificial energy landscape are compared with previous results for different choices of the 
model parameters. Despite the certain arbitrariness in the definition of artificial energy 
landscape, we obtain essentially the same diffusive behaviour as for the true DNA case. In 
Fig. ini we depict the diffusive behaviour for three different values of A^, with A^' = A^/2: 
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10^ 10^ 10^ 10* 10^ 10^ 
t (t.u.) 

Figure 3: Dynamic behaviour, obtained on the artificial Gaussian energy landscape, for = 10 
(full circles), = 14 (open circles), = 20 (squares), with f3e = 1 (upper curves) or 0.2 (lower 
curves). Time is measured in time units (t.u.), see text for details. 

as easily predicted, the asymptotic normal diffusion slows down when the number of bonds 
increases. This parameter thus affects the asymptotic diffusion regime as well as the initial 
subdiffusion and the transition time. 



V. CONCLUSIONS 

In this paper we have considered the sliding motion of a protein on DNA by means of 
a probabilistic model which includes the information about the base sequence through the 
base pair reading interaction. In the case of the T7 RNA-polymerase we found that the 
protein executes a random motion which deviates from the standard random walk dynamics 
usually assumed. We argued that the same qualitative behaviour should be valid also for 
other types of enzymes. The presence of an anomalous diffusion regime at the early stages of 
the process speeds up the mobility of the protein facilitating the target search. The overall 
diffusive behaviour of the sliding protein can be characterized in terms of few parameters: the 
typical interaction energy e associated with each DNA-protein bond, and the number N of 
such bonds formed at the recognition site. We conclude that only few parameters determine 
the overall diffusive behaviour of a sliding protein on DNA: the typical interaction energy 
e associated with each DNA-protein bond, and the number of such bonds formed at 
the recognition site. One can therefore expect the same qualitative behaviour described 
here on the example of T7 RNA-polymerase to be valid also for other types of enzymes 
(if other kind of specific chemical bonds intervene in the reco gnit ion mechanism, as e.g. 
water-bridges, minor groove H-bonds or hydrophobic contacts 12, the corresponding 



energies should be evaluated and included in the model; nevertheless, the number of specific 
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bonds is strictly a characteristic of each different enzyme-DNA interaction, and the diffusing 
behaviour must still depend on this number). 

We finally remark that the presence of additional sequence-dependent interaction in the 
recognition process, such as the one involving geometrical and elastic characteristics of the 
DNA, can also be included in our model. This additional interaction, being sequence specific, 
would lead to a redefinition of the energy landscape without effecting much the qualitative 
results of the paper (they however are much more difficult to model due to the scarcity of 
experimental data). In particular, discussed above anomalous diffusion regime is robust with 
respect to changes of the energy landscape. Therefore, the influence of the DNA sequence 
on the sliding motion of a protein on DNA makes the standard random walk assumption 
for sliding phase of the target search incorrect for a large set of parameters. Accounting 
for this anomalous diffusive motion should be included in realistic description of the sliding 
component of the target search in order to discriminate the relative role of ID sliding and 
3D diffusion in the search process. 

Acknowledgment s 

MS and VP wish to acknowledge the Laboratoire de Physique Theorique des Liquides, 
Universite Paris VI, for hospitality and partial support. MS and MB also acknowledge 
hospitality and partial support from the Forschungszentrum Juelich, Germany. 



[1] M. Guthold, X. Zhu, C. Rivctti, G. Yang, N. H. Thomson, S.Kasas, H. G. Hansma, B. Smith, 

N. K. Hansma, and C. Bustamante, Biophys. J. 77, 2284 (1999). 

[2] N. Shimamoto, J. Biol. Chem. 274, 15293 (1999). 

[3] U. Garland, J. Moroz, and T. Hwa, Proc. Nat. Acad. Sci. USA 99, 12015 (2002). 

[4] R. F. Bruinsma, Physica A 313, 211 (2002). 

[5] N. Stanford, M. Szczelkun, J. Marko, and S. Halford, EMBO J. 19, 6546 (2000). 

[6] S. E. Halford and M. D. Szczelkun, Eur. Biophys. J. 31, 257 (2002). 

[7] D. M. Gowers and S. E. Halford, EMBO J. 22, 1410 (2003). 

[8] O. G. Berg, R. B. Winter, and P. von Hippel, Biochemistry 20, 6929 (1981). 



11 



[9] In the context of directed translocation of proteins, the importance of sequence heterogeneity 
is recognized, see, e.g., Y. Kafri, D.K. Lubensky, D.R. Nelson, |cond-mat/0310455l 
[10] M. Barbi, C. Place, V. Popkov, and M. Salerno, J. of Biol. Phys. (to appear) (2004). 
[11] N. C. Seeman, J. M. Rosenberg, and A. Rich, Proc. Natl. Acad. Sci. USA 73, 804 (1976). 
[12] K. Nadassy, S. J. Wodak, and J. Janin, Biochemistry 38, 1999 (1999). 

[13] S. G. Kamzolova, V. S. Sivozhelezov, A. A. Sorokin, T. R. Dzhelyadin, N. N. Ivanova, and 

R. V. Polozov, J. Biol. Struc. Dyn. 18, 325 (2000). 
[14] A. Travers, DNA-Protein Interactions (Chapman and Hall, London, 1993), chap. 3 and 4. 
[15] J. M. Schurr, Biophys. Chem. 9, 413 (1979). 

[16] G. T. Cheetam, D. Jeruzalemi, and T. A. Steitz, Nature 399, 80 (1999). 

[17] J. Haus and K. Kehr, Physics Reports 150, 263 (1987). 

[18] J.-P. Bouchaud and A. Georges, Physics Reports 195, 127 (1990). 

[19] N. G. Van Kampen, Stochastic Processes in Physics and Chemistry (Nort-Holland, Amster- 
dam, 1981), chap. 11, eq. 2.14. 

[20] P. von Hippel, W. A. Rees, K. Rippe, and K. S. Wilson, Biophys. Chem. 59, 231 (1996). 

[21] An interesting discussion on the relevance of the number of base-pairs involved in the recog- 
nition mechanism is also presented in Ref. P]. 

[22] Detailed calculations will be presented in a future work. 

[23] C. G. Kalodimos, A. M. J. J. Bonvin, A. K. Salinas, R. Wechselberger, R. Boelens, and 
R. Kaptein, EMBO J. 21, 2866 (2002). 



12 



